A companion document for the R/Medicine 2021 Virtual Conference’s Introduction to R for Medical Data workshop.
Welcome to R/Medicine 2021. In this workshop we will be doing a hands-on introduction to R for medical data.
Learn how your organization could use R
Write your first R code!
See some highlights of the R ecosystem
You’ll need the following during the workshop:
A computer with internet connection
An RStudio Cloud account—free
Please note while you can use a local version of R and Rstudio for the workshop, we cannot provide support for you during the workshop to address issues that come up your local device and software. We use RStudio Cloud because we can do all the technical preparation for participants ahead of time and spend our limited workshop time on learning R as fast as possible.
If you would like to some additional preparation for the workshop, I recommend learning some about Markdown in this ten minute tutorial.
Introduction ~10 minutes
Systems check ~10 minutes
Learn and Do ~120 minutes
Wrap Up and Discussion ~10 minutes
For this workshop we will be using RStudio Cloud.
Why? Because there is nothing that you need to download!
You don’t need to use RStudio to use R, but I do!
RStudio on your computer and RStudio Cloud look very similar.
R and RStudio are free to download to your computer.
Log into RStudio Cloud
Raise hand in zoom if you are having trouble
Once you are in RStudio Cloud, click on the project called “Intro to R for Medical Data”.
Let’s open the file in the file pane (number 4 in the image above) called Intro_to_R_for_medical_data_workshop.Rmd. This is what we will be working for the workshop.
Your document will look something like this.
The YAML header contains the special instructions on how to create the output document. We won’t do much with it here today, but it is a very powerful way to make your Rmd file as bespoke as you want it!
Code chunks are where the code will go.
Code chunks have a gray background.
#This is a code chunk!
#Here is a simple calculation
1 + 2
[1] 3
You can run a code chunk by pressing the green play button.
No problem! R can ingest data from lots of locations including from Excel spreadsheets that you might already be very familiar with.
XXX smoke data from XXX with data dictionary here XXX put in link.
Run the code chunk below by presening that green play button.
smoke_complete <- read_excel("./data/smoke_complete.xlsx")
One thing that Excel does well is to provide an interactive visual representation of the data. This allows you to inspect it by sorting and filtering. RStudio actually does this well, too, with one difference - it won’t let you change any of the data while you inspect it.
Look on the right at the Environment pane (you might have to click icon that looks like a spreadsheet on the “Environment” tab) and find the entry smoke_complete. This is the data frame you just created inside of R’s memory. (If you don’t see smoke_complete, try running the code chunk above again).
Within the Environment pane, click on the smoke_complete to view the data (not the blue arrow).
Go ahead and try to edit one of the values in this viewer. You will find that you can’t. It would have been easy for the RStudio programmers to allow editing of specific values, but they decided not to add that feature.
Next we will use the glimpse() function to learn about our data.
glimpse(smoke_complete)
Rows: 1,152
Columns: 20
$ primary_diagnosis <chr> "C34.1", "C34.1", "C34.3", "C...
$ tumor_stage <chr> "stage ia", "stage ib", "stag...
$ age_at_diagnosis <dbl> 24477, 26615, 28171, 27154, 2...
$ vital_status <chr> "dead", "dead", "dead", "aliv...
$ morphology <chr> "8070/3", "8070/3", "8070/3",...
$ days_to_death <chr> "371", "136", "2304", "NA", "...
$ state <chr> "live", "live", "live", "live...
$ tissue_or_organ_of_origin <chr> "C34.1", "C34.1", "C34.3", "C...
$ days_to_birth <dbl> -24477, -26615, -28171, -2715...
$ site_of_resection_or_biopsy <chr> "C34.1", "C34.1", "C34.3", "C...
$ days_to_last_follow_up <chr> "NA", "NA", "2099", "3747", "...
$ cigarettes_per_day <dbl> 10.9589041, 2.1917808, 1.6438...
$ years_smoked <chr> "NA", "NA", "NA", "NA", "NA",...
$ gender <chr> "male", "male", "female", "ma...
$ year_of_birth <chr> "1936", "1931", "1927", "1930...
$ race <chr> "white", "asian", "white", "w...
$ ethnicity <chr> "not hispanic or latino", "no...
$ year_of_death <chr> "2004", "2003", "NA", "NA", "...
$ bcr_patient_barcode <chr> "TCGA-18-3406", "TCGA-18-3407...
$ disease <chr> "LUSC", "LUSC", "LUSC", "LUSC...
Next, we will use the skim() function to learn even more about our data.
Press the green play button in the code chunk below.
Look at the mini histograms at the bottom of the output? What can you learn very quickly about your data for the distribution cigarettes_per_year variable
skim(smoke_complete)
| Name | smoke_complete |
| Number of rows | 1152 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| character | 17 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| primary_diagnosis | 0 | 1 | 5 | 6 | 0 | 17 | 0 |
| tumor_stage | 0 | 1 | 7 | 12 | 0 | 11 | 0 |
| vital_status | 0 | 1 | 4 | 5 | 0 | 2 | 0 |
| morphology | 0 | 1 | 6 | 6 | 0 | 12 | 0 |
| days_to_death | 0 | 1 | 1 | 4 | 0 | 279 | 0 |
| state | 0 | 1 | 4 | 4 | 0 | 1 | 0 |
| tissue_or_organ_of_origin | 0 | 1 | 5 | 5 | 0 | 16 | 0 |
| site_of_resection_or_biopsy | 0 | 1 | 5 | 5 | 0 | 16 | 0 |
| days_to_last_follow_up | 0 | 1 | 1 | 4 | 0 | 459 | 0 |
| years_smoked | 0 | 1 | 1 | 2 | 0 | 48 | 0 |
| gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| year_of_birth | 0 | 1 | 2 | 4 | 0 | 67 | 0 |
| race | 0 | 1 | 5 | 41 | 0 | 6 | 0 |
| ethnicity | 0 | 1 | 12 | 22 | 0 | 3 | 0 |
| year_of_death | 0 | 1 | 2 | 4 | 0 | 21 | 0 |
| bcr_patient_barcode | 0 | 1 | 12 | 12 | 0 | 734 | 0 |
| disease | 0 | 1 | 4 | 4 | 0 | 3 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age_at_diagnosis | 0 | 1 | 24175.38 | 3926.13 | 7855.00 | 22068.75 | 24750.50 | 26927.00 | 32872 | <U+2581><U+2581><U+2583><U+2587><U+2582> |
| days_to_birth | 0 | 1 | -24175.38 | 3926.13 | -32872.00 | -26927.00 | -24750.50 | -22068.75 | -7855 | <U+2582><U+2587><U+2583><U+2581><U+2581> |
| cigarettes_per_day | 0 | 1 | 2.61 | 2.04 | 0.01 | 1.37 | 2.19 | 3.29 | 40 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Now we will take the data we have loaded and make some plots!
Produce a histogram of smoke_complete using geom_histogram(), mapping these variables to the following aesthetics:cigarettes_per_day to the x aesthetic. Save this plot to the variable plot_1 and show it.
plot_1 <- ggplot(data = smoke_complete,
aes(x = cigarettes_per_day)) +
geom_histogram()
plot_1
We’ve made a plot, but there are some issues with it:
Let’s make our plot a bit prettier.
labslabstheme_classic() to the plotplot_2 <- ggplot(data = smoke_complete,
aes(x = cigarettes_per_day)) +
geom_histogram() +
labs(title = "An excellent title for your plot",
x = "Cigarettes Per Day",
y = "Count") +
theme_classic()
plot_2
You got a message after running that code that: “stat_bin() using bins = 30. Pick better value with binwidth.”
Let’s modify the width of our histogram’s bins in the code chunk below to 2. Then try 10.
plot_3 <- ggplot(data = smoke_complete,
aes(x = cigarettes_per_day)) +
geom_histogram(binwidth = 4) +
labs(title = "An excellent title for your plot",
x = "Cigarettes Per Day",
y = "Count") +
theme_classic()
plot_3
Finally let’s explore the differences in daily cigarettes for the gender variable by adding a fill color and facet_grid.
Below add gender to the facet.
plot_4 <- ggplot(data = smoke_complete,
aes(x = cigarettes_per_day,
fill = gender)) +
geom_histogram(binwidth = 2) +
labs(title = "An excellent title for your plot",
x = "Cigarettes Per Day",
y = "Count",
fill = "Gender") +
theme_classic() +
facet_grid(. ~ gender)
plot_4
ggplot2: A Grammar of Graphicsggplot2 is an extremely powerful software library for visualization.
The gg is short for Grammar of Graphics, which means that visualizations are expressed in a specific and consistent way for different types of visualizations in the package. There is a wonderful universe of complementary packages in R that extend ggplot to many different types of visualizations.
Here’s a visual summary of the different parts we’re talking about today. There are many parts to visualizations, but many of us don’t have the words to describe the different types of parts of a graph. Here we will spend some time breaking down the meeting of different constituent parts.
Note: there are different ways to write ggplot code to get the same output, sort of formal versus causal conversation. Here we will focus on the more formal style because it is the most explicit, but also more verbose! As you get familiar with ggplot, you will likely shorten your code a bit at times from this very explicit method. We won’t cover every layer in depth, but know that breaking down visualizations to these constituents allows for tremendous control.
Image from Thomas Lin Pedersen’s ggplot2 workshop
ggplot2 codeA ggplot2 graphic consists of a mapping of variables in data to the aes()thetic attributes of geom_etric objects.
In code, this is translated as:
# start the plot object with ggplot()
# add assign smoke_complete to the data argument
ggplot(data = smoke_complete,
# map the variables to visual properties of the graph
mapping = aes(
# map the x-axis to age_at_diagnosis
x = age_at_diagnosis,
# map the y-axis to cigarettes_per_day
y = cigarettes_per_day,
# map the color to the disease variable
color = disease
)) +
# add the geometry and the alpha
geom_point(alpha = 0.2) + # complete the geom to geom_point
# add labels to your plot
labs(title = "Age of diagnosis of cancer by daily cigarette consumption", # plot title
x="Age in days", # x-axis label
y="Cigarettes per day", # y-axis label
color = "Disease type") + # label for color legend
# add a facet to your plot
facet_grid(cols = vars(gender)) + # facet by gender variable
# add a theme to your plot
theme_bw()
Things to note: we chain these things together with + (plus sign).
Set the data argument to smoke_complete, then run the code chunk.
ggplot(data = smoke_complete)
Add mapping arguments to the aes function. Assign the variable age_at_diagnosis to x, cigarettes_per_day to y, and disease to color.
Run the code chunk. How does it look different from the prior code chunk?
ggplot(data = smoke_complete, mapping = aes(x = age_at_diagnosis,
y = cigarettes_per_day,
color = disease))
Add geom_point().
ggplot(data = smoke_complete, mapping = aes(x = age_at_diagnosis,
y = cigarettes_per_day,
color = disease)) +
geom_point()
Ok, wow, looking great!! Any major differences you notice between the graph provided before the step by step process in code chunk called full_example above?
In the next modification, modify the alpha argument to vary the opacity of the points. You can vary the value from 0 to 1. Try a few different values to see the difference.
ggplot(data = smoke_complete, mapping = aes(x = age_at_diagnosis,
y = cigarettes_per_day,
color = disease)) +
geom_point(alpha = 0.2)
plot_5 <- ggplot(data = smoke_complete, mapping = aes(x = age_at_diagnosis,
y = cigarettes_per_day,
color = disease)) +
geom_point(alpha = 0.2) +
labs(title = " ",
x= " ",
y= " ",
color = " ")
plot_5
Wait! What little trick was that? See the variable plot_5 in the code chunk above? That is storing your whole code with your nice new labels. Instead of continuing to rewrite your code, you can just modify as needed. Below we are going to facet by gender to separate out male and female to inspect the differences.
plot_5 +
facet_grid(cols = vars(gender))
Finally we will play a bit with different themes. Pick two of the below themes and then add them to the plot_5 and run the code chunk to compare them.
Theme options:
There are plenty of out of the box themes in ggplot2, but you can make a theme, use other people’s, or even use a theme your organization might already have. Let’s load the `ggthemes’ library.
plot_5 + theme_economist() + labs(subtitle = "The Economist Style")
plot_5 + theme_tufte() + labs(subtitle = "Tufte Style")
Rarely is your data going to be in the form you need it to be analyzed and plotted. You will often need to wrangle your data and change the shape of it a bit.
Let’s discuss a bit different ways we might need to process data.
Discussion
There are different packages and ways people wrangle data with R, but we’re going to demonstrate using some packages from the {tidyverse}, which is a whole ecosystem of R packages organized around having tidy data.
From our smoke_complete data, let’s select two columns to keep: gender and days_to_death.
select(smoke_complete, gender, days_to_death) # select columns to keep
# A tibble: 1,152 x 2
gender days_to_death
<chr> <chr>
1 male 371
2 male 136
3 female 2304
4 male NA
5 female NA
6 male 345
7 male 716
8 male 2803
9 male 973
10 male 1097
# ... with 1,142 more rows
There is a lovely package called magrittr that was loaded earlier, but includes something called a pipe, that looks like this in code: %>% (pronounced “then”). It allows use to call a nicely pipe data and preform lots of tasks on it.
Run the code below and inspect the output
smoke_complete %>% select(gender, days_to_death)
# A tibble: 1,152 x 2
gender days_to_death
<chr> <chr>
1 male 371
2 male 136
3 female 2304
4 male NA
5 female NA
6 male 345
7 male 716
8 male 2803
9 male 973
10 male 1097
# ... with 1,142 more rows
Now let’s meet filter().
filter(smoke_complete, bcr_patient_barcode == "TCGA-18-3412") # filter to specific rows
# A tibble: 2 x 20
primary_diagnos~ tumor_stage age_at_diagnosis vital_status
<chr> <chr> <dbl> <chr>
1 C34.3 stage ib 19025 dead
2 C34.3 stage ib 19025 dead
# ... with 16 more variables: morphology <chr>, days_to_death <chr>,
# state <chr>, tissue_or_organ_of_origin <chr>,
# days_to_birth <dbl>, site_of_resection_or_biopsy <chr>,
# days_to_last_follow_up <chr>, cigarettes_per_day <dbl>,
# years_smoked <chr>, gender <chr>, year_of_birth <chr>,
# race <chr>, ethnicity <chr>, year_of_death <chr>,
# bcr_patient_barcode <chr>, disease <chr>
The pipe alternative way.
smoke_complete %>% filter(bcr_patient_barcode == "TCGA-18-3412")
# A tibble: 2 x 20
primary_diagnos~ tumor_stage age_at_diagnosis vital_status
<chr> <chr> <dbl> <chr>
1 C34.3 stage ib 19025 dead
2 C34.3 stage ib 19025 dead
# ... with 16 more variables: morphology <chr>, days_to_death <chr>,
# state <chr>, tissue_or_organ_of_origin <chr>,
# days_to_birth <dbl>, site_of_resection_or_biopsy <chr>,
# days_to_last_follow_up <chr>, cigarettes_per_day <dbl>,
# years_smoked <chr>, gender <chr>, year_of_birth <chr>,
# race <chr>, ethnicity <chr>, year_of_death <chr>,
# bcr_patient_barcode <chr>, disease <chr>
Together select() and filter() will be big workhorses in your data wrangling toolkit. However, at first it is easy to confuse which function does what.
select() selects the columns to stay.
filter() filters the rows to keep that meet certain conditions placed on columns. So in the example above, we filtered the data to only keep the rows where the bcr_patient_barcode was equal to “TCGA-18-3412”.
We meet the pipe %>% earlier, but let’s see how we can combine multiple pipes in a row to get a more complicated operation done.
# A tibble: 1,032 x 2
age_at_diagnosis gender
<dbl> <chr>
1 24477 male
2 26615 male
3 28171 female
4 27154 male
5 23370 female
6 19025 male
7 26938 male
8 28430 male
9 30435 male
10 24019 male
# ... with 1,022 more rows
Let’s make age_at_diagnosis more human friendly by creating a new column that divides the days by 365.25. (There are great R packages to handle data, times, duration, intervals, and all those other message time and date issues!)
mutate also lets you create multiple “mutations” (e.g., calculations) at the same time.
table_1 <- smoke_complete %>%
mutate(age_at_diagnosis_years = age_at_diagnosis/365.25,
age_at_diagnosis_years_round = round(age_at_diagnosis_years, 1)) %>%
select(age_at_diagnosis, age_at_diagnosis_years, age_at_diagnosis_years_round)
table_1
# A tibble: 1,152 x 3
age_at_diagnosis age_at_diagnosis_years age_at_diagnosis_years_rou~
<dbl> <dbl> <dbl>
1 24477 67.0 67
2 26615 72.9 72.9
3 28171 77.1 77.1
4 27154 74.3 74.3
5 23370 64.0 64
6 19025 52.1 52.1
7 26938 73.8 73.8
8 28430 77.8 77.8
9 30435 83.3 83.3
10 24019 65.8 65.8
# ... with 1,142 more rows
Let’s see how many unique values there are in each column of our dataset. The skim function from earlier will report the number of unique values in a column in the n_unique column.
skim(smoke_complete)
| Name | smoke_complete |
| Number of rows | 1152 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| character | 17 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| primary_diagnosis | 0 | 1 | 5 | 6 | 0 | 17 | 0 |
| tumor_stage | 0 | 1 | 7 | 12 | 0 | 11 | 0 |
| vital_status | 0 | 1 | 4 | 5 | 0 | 2 | 0 |
| morphology | 0 | 1 | 6 | 6 | 0 | 12 | 0 |
| days_to_death | 0 | 1 | 1 | 4 | 0 | 279 | 0 |
| state | 0 | 1 | 4 | 4 | 0 | 1 | 0 |
| tissue_or_organ_of_origin | 0 | 1 | 5 | 5 | 0 | 16 | 0 |
| site_of_resection_or_biopsy | 0 | 1 | 5 | 5 | 0 | 16 | 0 |
| days_to_last_follow_up | 0 | 1 | 1 | 4 | 0 | 459 | 0 |
| years_smoked | 0 | 1 | 1 | 2 | 0 | 48 | 0 |
| gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| year_of_birth | 0 | 1 | 2 | 4 | 0 | 67 | 0 |
| race | 0 | 1 | 5 | 41 | 0 | 6 | 0 |
| ethnicity | 0 | 1 | 12 | 22 | 0 | 3 | 0 |
| year_of_death | 0 | 1 | 2 | 4 | 0 | 21 | 0 |
| bcr_patient_barcode | 0 | 1 | 12 | 12 | 0 | 734 | 0 |
| disease | 0 | 1 | 4 | 4 | 0 | 3 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age_at_diagnosis | 0 | 1 | 24175.38 | 3926.13 | 7855.00 | 22068.75 | 24750.50 | 26927.00 | 32872 | <U+2581><U+2581><U+2583><U+2587><U+2582> |
| days_to_birth | 0 | 1 | -24175.38 | 3926.13 | -32872.00 | -26927.00 | -24750.50 | -22068.75 | -7855 | <U+2582><U+2587><U+2583><U+2581><U+2581> |
| cigarettes_per_day | 0 | 1 | 2.61 | 2.04 | 0.01 | 1.37 | 2.19 | 3.29 | 40 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Usually we want to get counts from a categorical variable, so when there are 100s of unique values, that is typically a column that we wouldn’t be counting.
Here we can use the count function to count the number of observations for each unique value of the vital_status column.
smoke_complete %>%
count(vital_status)
# A tibble: 2 x 2
vital_status n
* <chr> <int>
1 alive 647
2 dead 505
If we pass in multiple columns, it will give us a count for each combination of column values
smoke_complete %>%
count(vital_status, disease)
# A tibble: 6 x 3
vital_status disease n
<chr> <chr> <int>
1 alive BLCA 121
2 alive CESC 66
3 alive LUSC 460
4 dead BLCA 103
5 dead CESC 26
6 dead LUSC 376
If you want more of a “2x2” table of results, you can use the base-R table function. We can pass in each column of values from our data using the $ notation
table(smoke_complete$vital_status, smoke_complete$disease)
BLCA CESC LUSC
alive 121 66 460
dead 103 26 376
We can also use the addmargins if we want to see row and column sums with our count table.
table(smoke_complete$vital_status, smoke_complete$disease) %>%
addmargins()
BLCA CESC LUSC Sum
alive 121 66 460 647
dead 103 26 376 505
Sum 224 92 836 1152
There are several great packages that can make lovely tables that are publication ready.
One quick example here of our prior code chunk output table_1. We take that same object and let a package called kableExtra work some magic on it!
table_1 %>%
head(10) %>% #just to only print the top 10 rows
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| age_at_diagnosis | age_at_diagnosis_years | age_at_diagnosis_years_round |
|---|---|---|
| 24477 | 67.01437 | 67.0 |
| 26615 | 72.86790 | 72.9 |
| 28171 | 77.12799 | 77.1 |
| 27154 | 74.34360 | 74.3 |
| 23370 | 63.98357 | 64.0 |
| 19025 | 52.08761 | 52.1 |
| 26938 | 73.75222 | 73.8 |
| 28430 | 77.83710 | 77.8 |
| 30435 | 83.32649 | 83.3 |
| 24019 | 65.76044 | 65.8 |
R Markdown documents, like this document, allow you to place text and analysis with the code all in a single document and output the result into different formats such as an html webpage, a pdf, or even a word document.
Let’s quickly knit a document to seee the output. In pane 4, or the file pane, find the document called knit_preview.Rmd and open it. We will see examples of knitting the same .Rmd file to a word document, a pdf, and to html.
Within R you can type ? followed by the name of the function eg ?filter() in the console.
In RStudio, you can look at the help tab in pane 4.
However, in the end your best resource is probably going to be searching online for the issue. From online forms, to blog posts, to twitter threads there is a ton of content out there, but crafting a good search query is a workshop in itself!
clean_names()The last package we’ll talk about today is the janitor package. This package adds some really useful functions for cleaning data.
The one I use the most is called clean_names(). This will mutate column names into a standard format - all character values will be lowercase and spaces will be converted to underscores.
The smoke_complete dataset has already had this function applied to it, but let’s create a smaller example dataset where this wouldn’t be the case.
# A tibble: 1,152 x 4
`primary diagnosis` `tumor Stage` `Age at Diagnosis` vitalSTATUS
<chr> <chr> <dbl> <chr>
1 C34.1 stage ia 24477 dead
2 C34.1 stage ib 26615 dead
3 C34.3 stage ib 28171 dead
4 C34.1 stage ia 27154 alive
5 C34.1 stage iiia 23370 alive
6 C34.3 stage ib 19025 dead
7 C34.3 stage iv 26938 dead
8 C34.1 stage ib 28430 dead
9 C34.1 stage iib 30435 dead
10 C34.9 stage iv 24019 dead
# ... with 1,142 more rows
Ack! These column names have spaces, inconsistent capitalization, which make these difficult to work with.
So let’s use clean_names() to fix this:
library(janitor)
smoke_good_colnames <- smoke_bad_colnames %>%
clean_names()
smoke_good_colnames
# A tibble: 1,152 x 4
primary_diagnosis tumor_stage age_at_diagnosis vital_status
<chr> <chr> <dbl> <chr>
1 C34.1 stage ia 24477 dead
2 C34.1 stage ib 26615 dead
3 C34.3 stage ib 28171 dead
4 C34.1 stage ia 27154 alive
5 C34.1 stage iiia 23370 alive
6 C34.3 stage ib 19025 dead
7 C34.3 stage iv 26938 dead
8 C34.1 stage ib 28430 dead
9 C34.1 stage iib 30435 dead
10 C34.9 stage iv 24019 dead
# ... with 1,142 more rows
By default, clean_names() converts to snake case (spaces between words are replaced by an underscore _). clean_names() will help you standardize your dataset’s column names. Knowing that everything follows a certain convention will make your data cleaning journey better.
ggplot2: Elegant Graphics for Data Analysis
The list is always growing! The trouble isn’t finding good resources, it’s finding time to read them!
This work was made possible by the distill, ggplot, dplyr, and rmarkdown packages.